Data cleaning example project

Fri 06 March 2020

The data used in this example project is the titanic dataset which is a very common and widely used dataset and therefore also quite uninteresting as subject to an extensive data analysis or machine learning. But the version of the dataset that I have required suits the purpose of illustrating my data cleaning capabilities very well because it consists of different parts that have to be put together and contains problematic values of different kinds which have to be handled.

Imports and display setup:

In [19]:
import pandas as pd

Connecting the different parts of the dataset

Connecting part 1 and part 2:

In [20]:
part_1_df = pd.read_excel("./data/datasæt del1.xlsx")
part_2_df = pd.read_csv("./data/datasæt del2 komma.csv")

titanic_df = pd.concat([part_1_df, part_2_df])
titanic_df.reset_index(drop=True, inplace=True)
titanic_df.tail()
Out[20]:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest case_id
1308 3 0.0 female NaN 0 0 65305 8.1125 NaN S NaN NaN NaN 1308
1309 3 0.0 male 0.75 1 1 SOTON/O.Q. 3101315 13.775 NaN S NaN NaN NaN 1309
1310 3 1.0 male 25.00 0 0 345768 9.5 NaN S 11 NaN NaN 1310
1311 3 1.0 male 14.00 0 0 7538 9.225 NaN S 13 NaN NaN 1311
1312 3 0.0 male 25.00 1 0 347076 7.775 NaN S NaN NaN NaN 1312

Adding names to the dataframe:

In [21]:
names_txt = pd.read_csv('./data/navne.txt', sep=",", header=None)
names_column = names_txt.loc[1:,1].reset_index(drop=True)

titanic_df['name'] = names_column
titanic_df.head()
Out[21]:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest case_id name
0 1 0.0 male 67.0 1 0 PC 17483 221.779 C55 C57 S NaN 96.0 New York, NY 1 Straus, Mr. Isidor
1 1 0.0 male 46.0 0 0 694 26 NaN S NaN 80.0 Bennington, VT 2 Jones, Mr. Charles Cresson
2 1 0.0 male 50.0 1 0 PC 17761 106.425 C86 C NaN 62.0 Deephaven, MN / Cedar Rapids, IA 3 Douglas, Mr. Walter Donald
3 1 0.0 male 61.0 0 0 36963 32.3208 D50 S NaN 46.0 Haddenfield, NJ 4 Sutton, Mr. Frederick
4 1 0.0 male 32.5 0 0 113503 211.5 C132 C NaN 45.0 NaN 5 Keeping, Mr. Edwin

Setting case_id as the index of the dataframe:

In [22]:
titanic_df.set_index('case_id', inplace=True)

Data cleaning

Removing duplicates in the dataset:

In [23]:
print('Locating duplicates:')
display(titanic_df[titanic_df.duplicated()])
print('\n Duplicates before:')
display(titanic_df.loc[550:551])
titanic_df = titanic_df.drop_duplicates()
# Showing the duplicate
print('\n Duplicates after:')
display(titanic_df.loc[550:551])
Locating duplicates:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest name
case_id
550 2 0.0 male NaN 0 0 239853 0 NaN S NaN NaN Belfast Cunningham, Mr. Alfred Fleming
 Duplicates before:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest name
case_id
550 2 0.0 male NaN 0 0 239853 0 NaN S NaN NaN Belfast Cunningham, Mr. Alfred Fleming
550 2 0.0 male NaN 0 0 239853 0 NaN S NaN NaN Belfast Cunningham, Mr. Alfred Fleming
551 2 0.0 male 35.0 0 0 233734 12.35 NaN Q NaN NaN NaN Keane, Mr. Daniel
 Duplicates after:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest name
case_id
550 2 0.0 male NaN 0 0 239853 0 NaN S NaN NaN Belfast Cunningham, Mr. Alfred Fleming
551 2 0.0 male 35.0 0 0 233734 12.35 NaN Q NaN NaN NaN Keane, Mr. Daniel

Removing row which contains too high fare value:

In [24]:
# Change the type in the fare column to float 
titanic_df['fare'].replace('.', 0.0, inplace=True)
numeric_fare = pd.to_numeric(titanic_df['fare'])
titanic_df.loc[:, 'fare'] = numeric_fare

# Finding the row with the maximum fare value and removing it from the dataframe
print('Max fare value before:')
display(titanic_df.loc[titanic_df['fare'] == titanic_df['fare'].max()])
titanic_df.drop(60, axis=0, inplace=True)
print('\n Max fare value after:')
display(titanic_df.loc[titanic_df['fare'] == titanic_df['fare'].max()])
Max fare value before:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest name
case_id
60 1 1.0 female 17.0 0 1 12345 25000.0 NaN S NaN NaN Philadelphia, PA Bukater, Miss. Rose Dewitt
 Max fare value after:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest name
case_id
150 1 1.0 male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C 3 NaN Austria-Hungary / Germantown, Philadelphia, PA Cardeza, Mr. Thomas Drake Martinez
199 1 1.0 male 35.0 0 0 PC 17755 512.3292 B101 C 3 NaN NaN Lesurer, Mr. Gustave J
267 1 1.0 female 35.0 0 0 PC 17755 512.3292 NaN C 3 NaN NaN Ward, Miss. Anna
296 1 1.0 female 58.0 0 1 PC 17755 512.3292 B51 B53 B55 C 3 NaN Germantown, Philadelphia, PA Cardeza, Mrs. James Warburton Martinez (Charlo...

Removing datasource where the death is hard to determine:

In [25]:
# Locate values which are different than 0 and 1 in the survived column
print('Values different than 0 or 1 in survived column before:')
display(titanic_df.loc[~titanic_df['survived'].isin([1,0])])
titanic_df.drop(325, axis=0, inplace=True)
print('\n Values different than 0 or 1 in survived column after:')
display(titanic_df.loc[~titanic_df['survived'].isin([1,0])])
Values different than 0 or 1 in survived column before:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest name
case_id
325 1 NaN male 68.0 0 0 36010 10.0 NaN S NaN NaN New York City, NY Wonder, Mr. Stevie
 Values different than 0 or 1 in survived column after:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest name
case_id

Removing data scource where age is below 0:

In [26]:
print('Value in age column below 0 before:')
display(titanic_df.loc[titanic_df['age'] < 0])
titanic_df.drop(172, axis=0, inplace=True)
print('\n Value in age column below 0 after:')
display(titanic_df.loc[titanic_df['age'] < 0])
Value in age column below 0 before:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest name
case_id
172 1 1.0 male -62.0 0 0 24161 10.0 Shutter Island S NaN NaN Los Angeles, CA DiCaprio, Mr. Leonardo
 Value in age column below 0 after:
pclass survived sex age sibsp parch ticket fare cabin embarked boat body homedest name
case_id

Further cleaning of data

Overview of the cleaned dataframe as it is now:

In [27]:
print('Summary info about the dataframe:\n')
display(titanic_df.info())
print('\n Data types:')
display(titanic_df.dtypes)
Summary info about the dataframe:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1312
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   float64
 2   sex       1309 non-null   object 
 3   age       1046 non-null   float64
 4   sibsp     1309 non-null   int64  
 5   parch     1309 non-null   int64  
 6   ticket    1309 non-null   object 
 7   fare      1309 non-null   float64
 8   cabin     295 non-null    object 
 9   embarked  1307 non-null   object 
 10  boat      486 non-null    object 
 11  body      121 non-null    float64
 12  homedest  745 non-null    object 
 13  name      1309 non-null   object 
dtypes: float64(4), int64(3), object(7)
memory usage: 153.4+ KB
None
 Data types:
pclass        int64
survived    float64
sex          object
age         float64
sibsp         int64
parch         int64
ticket       object
fare        float64
cabin        object
embarked     object
boat         object
body        float64
homedest     object
name         object
dtype: object

Turning sibsp (sibling/spouse) and parch (parent/child) columns into one column called family and removing the individual columns:

In [28]:
titanic_df['family_size'] = titanic_df['sibsp'] + titanic_df['parch'] + 1
titanic_df = titanic_df.drop('sibsp', axis=1)
titanic_df = titanic_df.drop('parch', axis=1)
display(titanic_df.head())
pclass survived sex age ticket fare cabin embarked boat body homedest name family_size
case_id
1 1 0.0 male 67.0 PC 17483 221.7792 C55 C57 S NaN 96.0 New York, NY Straus, Mr. Isidor 2
2 1 0.0 male 46.0 694 26.0000 NaN S NaN 80.0 Bennington, VT Jones, Mr. Charles Cresson 1
3 1 0.0 male 50.0 PC 17761 106.4250 C86 C NaN 62.0 Deephaven, MN / Cedar Rapids, IA Douglas, Mr. Walter Donald 2
4 1 0.0 male 61.0 36963 32.3208 D50 S NaN 46.0 Haddenfield, NJ Sutton, Mr. Frederick 1
5 1 0.0 male 32.5 113503 211.5000 C132 C NaN 45.0 NaN Keeping, Mr. Edwin 1

Locating NaN values in the dataset:

In [29]:
titanic_df.isnull().sum()
Out[29]:
pclass            0
survived          0
sex               0
age             263
ticket            0
fare              0
cabin          1014
embarked          2
boat            823
body           1188
homedest        564
name              0
family_size       0
dtype: int64

Removing the cabin, boat, and body columns because it contains way too many NaN values:

In [30]:
titanic_df.drop(columns='cabin', inplace=True)
titanic_df.drop(columns='boat', inplace=True)
titanic_df.drop(columns='body', inplace=True)

Replacing NaN values in age col with median values:

In [31]:
titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].median())

Replacing NaN values in embarked col with embarked value with highest frequency:

In [32]:
display(titanic_df['embarked'].mode())
titanic_df['embarked'] = titanic_df['embarked'].fillna('S')
0    S
dtype: object

Creating adult col to indicate if the person is an adult or a child:

In [33]:
titanic_df['adult'] = 0
titanic_df['adult'][titanic_df['age'] >= 18] = 1
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

Renaming pclass column to something that is more interpretable:

In [34]:
titanic_df.rename(columns={'pclass': 'price_class'}, inplace=True)

The cleaned dataset now looks like this:

In [35]:
titanic_df.head()
Out[35]:
price_class survived sex age ticket fare embarked homedest name family_size adult
case_id
1 1 0.0 male 67.0 PC 17483 221.7792 S New York, NY Straus, Mr. Isidor 2 1
2 1 0.0 male 46.0 694 26.0000 S Bennington, VT Jones, Mr. Charles Cresson 1 1
3 1 0.0 male 50.0 PC 17761 106.4250 C Deephaven, MN / Cedar Rapids, IA Douglas, Mr. Walter Donald 2 1
4 1 0.0 male 61.0 36963 32.3208 S Haddenfield, NJ Sutton, Mr. Frederick 1 1
5 1 0.0 male 32.5 113503 211.5000 C NaN Keeping, Mr. Edwin 1 1